SHORT VIDEO INTRODUCTION

Classification

  • Classification is the most basic form of data analysis.
    • e.g., application of a loan, credit card transaction
  • To examine data where the classification is
    • unknown or
    • will occur in the future
  • Similar to classification is develop rules

Prediction

  • Prediction is to predict the value of a numerical variable (e.g., amount of purchase) rather than a class (e.g.,purchaser or nonpurchaser).
  • the value of a continuous variable.

Assiciation Rules and Recommendation Systems

  • What goes with what
  • Association rules, or affinity analysis, is designed to find such general associations patterns between items in large databases.
    • grocery stores: product placement, weekly promotional offers, bundling products.
    • hospital database: which symptom is followed by what other symptom to help predict future symptoms for returning patients.
    • Online recommendation systems: collaborative filtering

Collaborative filtering

Collaborative filteringa method that uses individual users’ preferences and tastes given their historic purchase, rating, browsing, or any other measurable behavior indicative of preference, as well as other users’ history.

Association rules vs Collaborative filtering

In contrast to association rules that generate rules general to an entire population, collaborative filtering generates “what goes with what” at the individual user level. Hence, collaborative filtering is used in many recommendation systems that aim to deliver personalized recommendations to users with a wide range of preferences.

Predictive Analytics

  • Classification, prediction, and to some extent, association rules and collaborative filtering constitute the analytical methods employed in predictive analytics.
  • The term predictive analytics is sometimes used to also include data pattern identification methods such as clustering.

Data Reduction and Dimension Reduction

  • Data mining algorithms is often improved
    • when the number of variables is limited, and
    • when large numbers of records can be grouped into homogeneous groups.
  • For example,
    • rather than dealing with thousands of product types, an analyst might wish to group them into a smaller number of groups and build separate models for each group.
    • Or a marketer might want to classify customers into different “personas,” and must therefore group customers into homogeneous groups to define the personas.
  • This process of consolidating a large number of records (or cases) into a smaller set is termed data reduction. Methods for reducing the number of cases are often called clustering.
  • Reducing the number of variables is typically called dimension reduction.
    • Dimension reduction is a common initial step before deploying data mining methods, intended to improve predictive power, manageability, and inter-pretability.

Data Exploration and Visualization

  • Exploration is aimed at
    • understanding the global landscape of the data, and
    • detecting unusual values.
  • Exploration is used for data cleaning and manipulation as well as for visual discovery and hypothesis generation.
  • Methods for exploring data include looking at various data aggregations and summaries
  • (Both numerically and graphically) looking at each variable separately as well as looking at relationships among variables.
  • The purpose is to discover patterns and exceptions.
  • Data Visualization or Visual Analytics - Exploration by creating charts and dashboards
  • For numerical variables,
    • we use histograms and boxplots to learn about the distribution of their values,
      • to detect outliers (extreme observations), and to
      • find other information that is relevant to the analysis task.
  • For categorical variables,
    • we use bar charts. We can also look at scatter plots of pairs of numerical variables
      • to learn about possible relationships,
      • the type of relationship,
      • to detect outliers.

Visualization can be greatly enhanced by adding features such as color and interactive navigation.

Supervised and Unsupervised Learning

Supervised Learning Algorithms

  • Supervised learning algorithms are those used in classification and prediction.
    • We must have data available in which the value of the outcome of interest (e.g., purchase or no purchase) is known, “labeled data”
    • These training data are the data from which the classification or prediction algorithm “learns,” or is “trained,” about the relationship between predictor variables and the outcome variable.
    • Once the algorithm has learned from the training data, it is then applied to another sample of labeled data (the validation data) where the outcome is known but initially hidden, to see how well it does in comparison to other models.
    • If many different models are being tried out, it is prudent to save a third sample, which also includes known outcomes (the test data) to use with the model finally selected to predict how well it will do.
    • The model can then be used to classify or predict the outcome of interest in new cases where the outcome is unknown.

Linear Regression as Supervised Machine Learning Algorithm

  • Simple linear regression is an example of a supervised learning algorithm (although rarely called that in the introductory statistics course where you probably first encountered it).
    • The Y variable is the (known) outcome variable and the X variable is a predictor variable.
    • A regression line is drawn to minimize the sum of squared deviations between the actual Y values and the values predicted by this line.
    • The regression line can now be used to predict Y values for new values of X for which we do not know the Y value.

Unsupervised Learning Algorithms

  • Unsupervised learning algorithms are those used where there is no outcome variable to predict or classify.
  • Hence, there is no “learning” from cases where such an outcome variable is known.
  • Association rules, dimension reduction methods, and clustering techniques are all unsupervised learning methods.
  • Supervised and unsupervised methods are sometimes used in conjunction.

Note

For example, unsupervised clustering methods are used to separate loan applicants into several risk-level groups. Then, supervised algorithms are applied separately to each risk-level group for predicting propensity of loan default.

Data Mining Steps

  1. Develop an understanding of the purpose of the data mining project.
  2. Obtain the dataset to be used in the analysis.
  3. Explore, clean, and preprocess the data.
  4. Reduce the data dimension, if necessary.
  5. Determine the data mining task.
  6. Partition the data (for supervised tasks).
  7. Choose the data mining techniques to be used.
  8. Use algorithms to perform the task.
  9. Interpret the results of the algorithms.
  10. Deploy the model.

SEMMA

The foregoing steps encompass the steps in SEMMA, a methodology developed by the software company SAS:

  • Sample: Take a sample from the dataset; partition into training, validation, and test datasets.
  • Explore: Examine the dataset statistically and graphically.
  • Modify: Transform the variables and impute missing values.
  • Model: Fit predictive models (e.g., regression tree, neural network).
  • Assess: Compare models using a validation dataset.

IBM SPSS Modeler (previously SPSS-Clementine) has a similar methodology, termed CRISP-DM (CRoss-Industry Standard Process for Data Mining). All these frameworks include the same main steps involved in predictive modeling.

Other Models

  • KDD Model: Knowledge Discovery in Databases (KDD) is a systematic process that seeks to identify valid, novel, potentially useful, and ultimately understandable patterns from large amounts of data. In simpler terms, it’s about transforming raw data into valuable knowledge.

  • CRISP-DM: CRISP-DM stands for Cross-Industry Standard Process for Data Mining. It is a cyclical process that provides a structured approach to planning, organizing, and implementing a data mining project. The process consists of six major phases: Business Understanding, Data Understanding, Data, Preparation, Modeling, Evaluation, Deployment

More to Read

Assignment

  • What to do
  • Requirement
    • PDF format
    • file name should be include your student id and name
      • stuID_name_title.pdf (e.g. 1111111_ChungilChae_SelfIntroduction.pdf)
  • Due date
    • by DATE 11:59PM
    • NO LATE SUBMISSION ALLOWED!!!!

Reference

Shmueli, G., Bruce, P. C., Yahav, I., Patel, N. R., & Lichtendahl Jr, K. C. (2017). Data mining for business analytics: Concepts, techniques, and applications in r. John Wiley & Sons.